{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/experimental/semantic-search-intro/levenshtein.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/experimental/semantic-search-intro/levenshtein.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Levenshtein Distance\n",
    "\n",
    "In this notebook we'll explore Levenshtein distance, and how we can use it to compare strings.\n",
    "\n",
    "We will be using a Wagner-Fischer matrix to calculate our Levenshtein distance, let's write a function that will perform this operation for us given two strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "def leven(a, b):\n",
    "    # we must add an additional character at the start of each string\n",
    "    a = f' {a}'\n",
    "    b = f' {b}'\n",
    "    # initialize empty zero array\n",
    "    lev = np.zeros((len(a), len(b)))\n",
    "    # now begin iterating through each value, finding the best path\n",
    "    for i in range(len(a)):\n",
    "        for j in range(len(b)):\n",
    "            if min([i, j]) == 0:\n",
    "                lev[i, j] = max([i, j])\n",
    "            else:\n",
    "                # calculate our three possible operations\n",
    "                x = lev[i-1, j]  # deletion\n",
    "                y = lev[i, j-1]  # insertion\n",
    "                z = lev[i-1, j-1]  # substitution\n",
    "                # take the minimum (eg best path/operation)\n",
    "                lev[i, j] = min([x, y, z])\n",
    "                # and if our two current characters don't match, add 1\n",
    "                if a[i] != b[j]:\n",
    "                    # if we have a match, don't add 1\n",
    "                    lev[i, j] += 1\n",
    "    return lev, lev[-1, -1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's test this on some strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([[ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.],\n",
       "        [ 1.,  0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.],\n",
       "        [ 2.,  1.,  0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.],\n",
       "        [ 3.,  2.,  1.,  0.,  0.,  1.,  2.,  3.,  4.,  5.,  5.,  6.],\n",
       "        [ 4.,  3.,  2.,  0.,  0.,  1.,  2.,  3.,  4.,  5.,  5.,  6.],\n",
       "        [ 5.,  4.,  3.,  1.,  1.,  0.,  1.,  2.,  3.,  4.,  5.,  6.],\n",
       "        [ 6.,  5.,  4.,  2.,  2.,  1.,  0.,  1.,  2.,  3.,  4.,  5.],\n",
       "        [ 7.,  6.,  5.,  3.,  3.,  2.,  1.,  0.,  1.,  2.,  3.,  4.],\n",
       "        [ 8.,  7.,  6.,  4.,  4.,  2.,  2.,  1.,  1.,  2.,  3.,  4.],\n",
       "        [ 9.,  8.,  7.,  5.,  5.,  3.,  3.,  2.,  2.,  1.,  2.,  3.],\n",
       "        [10.,  9.,  8.,  5.,  5.,  4.,  4.,  3.,  3.,  2.,  1.,  2.],\n",
       "        [11., 10.,  9.,  6.,  6.,  5.,  5.,  4.,  4.,  3.,  2.,  1.]]),\n",
       " 1.0)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "leven('hello world', 'hello wurld')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we only need to make one change between our two strings, switching between `u` and `o` in `world`. The bottom-right value gives us our Levenshtein distance (the number of changes that must be made), which is equal to **1**.\n",
    "\n",
    "Let's try it with another."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([[ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.],\n",
       "        [ 1.,  0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.],\n",
       "        [ 2.,  1.,  0.,  1.,  2.,  3.,  4.,  5.,  5.,  6.],\n",
       "        [ 3.,  2.,  1.,  0.,  1.,  2.,  3.,  4.,  5.,  6.],\n",
       "        [ 4.,  3.,  1.,  1.,  1.,  2.,  3.,  4.,  4.,  5.],\n",
       "        [ 5.,  4.,  2.,  2.,  2.,  1.,  2.,  3.,  4.,  4.],\n",
       "        [ 6.,  5.,  3.,  3.,  3.,  2.,  1.,  2.,  3.,  4.],\n",
       "        [ 7.,  6.,  4.,  4.,  4.,  3.,  2.,  2.,  3.,  4.],\n",
       "        [ 8.,  7.,  5.,  5.,  5.,  4.,  3.,  2.,  3.,  4.],\n",
       "        [ 9.,  8.,  5.,  6.,  6.,  5.,  4.,  3.,  2.,  3.],\n",
       "        [10.,  9.,  6.,  6.,  6.,  6.,  5.,  4.,  3.,  3.],\n",
       "        [11., 10.,  7.,  7.,  7.,  6.,  6.,  5.,  4.,  3.]]),\n",
       " 3.0)"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "leven('Levenshtein', 'Levinsten')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we find that we need **3** operations two switch between `'Levenshtein'` and `'Levinshten'`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13 (default, Mar 28 2022, 06:59:08) [MSC v.1916 64 bit (AMD64)]"
  },
  "orig_nbformat": 2,
  "vscode": {
   "interpreter": {
    "hash": "5fe10bf018ef3e697f9035d60bf60847932a12bface18908407fd371fe880db9"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
